Statistics for Bioengineering - Week 2

Author

Bill Cresko

Week 2: EDA, Probability, and Experimental Design

Week 2 Topics

  • Exploratory data analysis
  • Data visualization
  • Parameters & statistics
  • Probability distributions
  • Estimates & confidence intervals
Note

Readings: Chapters 9-15

Packages for This Week

# Install new packages (run once)
install.packages("nycflights13")

# Load packages
library(tidyverse)    # ggplot2, dplyr, tidyr for data wrangling & visualization
library(nycflights13) # NYC flights data for dplyr examples
Note

Dataset from package: flights (336,776 flights departing NYC in 2013) - used for dplyr practice

Exploratory Data Analysis

What is EDA?

  • First step in any data analysis
  • Understand structure and patterns in your data
  • Identify outliers, missing values, errors
  • Generate hypotheses for formal testing
Always visualize your data before running statistical tests!

Data Wrangling with Tidyverse

Tidyverse Family of Packages

Figure 1: Tidyverse packages overview
library(tidyverse)

A tibble - a tidyverse dataframe

A tibble - a tidyverse dataframe

Types of variables in a tibble

Figure 2: dplyr functions

Key dplyr Verbs

  • dpylr is a package in the tidyverse that gives you functions (aka verbs) to wrangle data.
Verb Purpose Example
filter() Subset rows by values filter(df, x > 5)
select() Subset columns by name select(df, col1, col2)
arrange() Reorder rows arrange(df, x)
mutate() Create new columns mutate(df, z = x + y)
summarise() Collapse to summary summarise(df, mean = mean(x))

Filter, Arrange, and Select

# Filter rows
filter(flights, month == 11 | month == 12)

# Arrange rows
arrange(flights, year, month, day)

# Select columns
select(flights, year, month, day)

Conditional Operators

Operator Meaning
== Equals exactly
<, <= Less than (or equal)
>, >= Greater than (or equal)
!= Not equal to
! NOT operator
& AND operator
| OR operator
%in% Belongs to set

Mutate and Transmute

# Add new columns based on existing ones
mutate(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

# Using pipes
flights |>
  mutate(
    gain = arr_delay - dep_delay,
    speed = distance / air_time * 60
  )

Group By and Summarise

# Group by categorical variable
by_day <- group_by(flights, year, month, day)

# Summarise within groups
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
Warning

Aggregation functions return NA if any input value is NA. Use na.rm = TRUE to remove missing values.

Other Useful dplyr Functions

Function Purpose
slice() Subset rows by position
pull() Extract column as vector
count() Count observations
distinct() Unique observations

R Exercise: Data Wrangling

Exercise
  1. Read in Week1b_Stickle_RNAseq.tsv or knee_injury.csv dataset
  2. Select a subset of categorical variables + quantitative variables
  3. Mutate to create square root transformed versions
  4. Summarise mean and SD grouped by categories (such as sex, population, treatment)
  5. Write results to a .csv file

Data Visualization with ggplot2

Introduction to ggplot2

  • Part of the tidyverse suite of packages
  • GG stands for “Grammar of Graphics”
  • Start with ggplot(), supply dataset and aesthetic mapping with aes()
  • Add geometry layers with geom_*() functions

Grammar of Graphics

Scatterplots

Scatterplots Code

ggplot(mpg, aes(displ, hwy, color = class)) +
  geom_point(size = 4, alpha = 0.6)

Boxplots

Boxplots

Boxplots Code

ggplot(mpg, aes(manufacturer, hwy, colour = class)) +
  geom_boxplot() +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 45, hjust=1))

Common ggplot Geoms

Geom Purpose Data Type
geom_point() Scatterplots Two continuous
geom_line() Line plots Continuous over ordered
geom_bar() Bar charts Categorical counts
geom_histogram() Histograms Single continuous
geom_boxplot() Boxplots Continuous by category
geom_smooth() Trend lines Two continuous

Combining Geoms

Combining Geoms Code

ggplot(data=mpg, aes(x=displ, y=hwy)) +
  geom_point() +
  geom_smooth() +
  labs(title = "Fuel efficiency decreases with engine size",
       caption = "Data from fueleconomy.gov")

Faceting

Faceting Code

ggplot(data=mpg) +
  geom_point(mapping=aes(x=displ, y=hwy)) +
  facet_wrap(~class, nrow=2)

Three-Dimensional Data

Three-Dimensional Data Code

set.seed(345)
d <- data.frame(a = rnorm(100, 10, 10), b = rnorm(100, 5, 5))
ggplot(d, aes(x=a, y=b)) +
  geom_density2d_filled() +
  theme_minimal()

Choosing the Right Plot

R Exercise: Data Visualization

Exercise
  1. Read in Week1b_Stickle_RNAseq.tsv or knee_injury.csv dataset
  2. Create histograms of data
  3. Create histograms on a faceted figure for different levels of a category
  4. Repeat steps 2 and 3 but create boxplots
  5. Write plots to a .pdf file

Best Practices in Data Visualization

Good, Bad, and Ugly

A ticker tape parade of confusion

A line to no understanding

Wack a mole

Disk of disinformation

Steaming pie chart mess

A bake sale of pie charts

Minard’s map of Napoleon’s Russian campaign - Edward Tufte

Modern recreation of Minard-style data visualization

Principles of Effective Display

“Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space”

— Edward Tufte

  • Show the data
  • Encourage the eye to compare differences
  • Represent magnitudes honestly and accurately
  • Draw graphical elements clearly, minimizing clutter
  • Make displays easy to interpret

“Above All Else Show the Data”

{#fig-align=“center” width=“80%”}

“Maximize the Data to Ink Ratio”

The Economist 2006

Represent Magnitudes Honestly

Rattenborg et al. 1999 Nature

How NOT to Make a Figure

“Graphical excellence begins with telling the truth about the data” – Tufte 1983

Parameters and Statistics

Population vs. Sample

Concept Population Sample
Definition All individuals Subset of population
Parameters μ (mean), σ (SD) x̄ (mean), s (SD)
Goal What we want to know What we can measure
Important

We use sample statistics to estimate population parameters

Accuracy vs. Precision

  • Accuracy: closeness to true value
  • Precision: closeness of repeated estimates to each other
  • Replication quantifies variation
  • Randomization avoids bias

Stochastic Processes in Statistics

  • We often want to know truths about the world, but can only estimate them
  • Uncertainty in those estimates is a given
  • Statistics is largely about quantifying and managing uncertainty
  • Random variables are products of stochastic processes
  • Expectations are based on theoretical probability distributions

Two Interpretations of Probability

Frequency interpretation:

“Probabilities are mathematically convenient approximations to long run relative frequencies.”

Subjective (Bayesian) interpretation:

“A probability statement expresses the opinion of some individual regarding how certain an event is to occur.”

Random Variables and Probability

  • Probability is the expression of belief in some future outcome
  • A random variable can take on different values with different probabilities
  • The sample space is the universe of all possible values
  • Sample space represented by:
    • Probability mass distribution (discrete)
    • Probability density function (continuous)
Important

Probabilities of a sample space always sum to 1.0

Probability Rules

‘And’ rule (multiplication): \[Pr(X \text{ and } Y) = Pr(X) \times Pr(Y)\]

‘Or’ rule (addition): \[Pr(X \text{ or } Y) = Pr(X) + Pr(Y)\]

Note

The ‘and’ rule assumes independent events. For non-independent events, we need conditional probabilities.

Joint and Conditional Probability

Joint probability (independent events): \[Pr(X,Y) = Pr(X) \times Pr(Y)\]

Conditional probability (independent): \[Pr(Y|X) = Pr(Y) \text{ and } Pr(X|Y) = Pr(X)\]

Conditional probability (non-independent): \[Pr(Y|X) \neq Pr(Y) \text{ and } Pr(X|Y) \neq Pr(X)\]

Likelihood vs. Probability

  • Probability: proportion of times an event would occur over many trials
  • Likelihood: conditional probability of a parameter value given data

\[L[\text{parameter}|\text{data}] = Pr[\text{data}|\text{parameter}]\]

  • Maximum likelihood: highest value of the likelihood function
  • Bayesian estimate: uses prior distribution to update posterior distribution

Moments of Distributions

1st moment (mean/expectation): \[E[X] = \sum_{\text{all x}}xP(X=x) = \mu\]

2nd moment (variance): \[Var(X) = E[(X-\mu)^2] = \sigma^2\]

Standard deviation: \[SD = \sqrt{\sigma^2} = \sigma\]

Higher moments include skewness (3rd) and kurtosis (4th).

Discrete Probability Distributions

PMF, PDF, and CDF in R

Figure 3: Comparing discrete (Poisson) and continuous (Normal) distributions (probability mass function, probability density function, and cumulative density function)

PMF, PDF, and CDF Code

par(mfrow = c(2, 2))

# Discrete (Poisson) PMF
x_disc <- 0:15
plot(x_disc, dpois(x_disc, lambda = 5), type = "h", lwd = 3, col = "steelblue",
     main = "PMF: Poisson(λ=5)", xlab = "x", ylab = "P(X = x)")
points(x_disc, dpois(x_disc, lambda = 5), pch = 19, col = "steelblue")

# Continuous (Normal) PDF
x_cont <- seq(-4, 4, length.out = 200)
plot(x_cont, dnorm(x_cont), type = "l", lwd = 2, col = "coral",
     main = "PDF: Normal(0, 1)", xlab = "x", ylab = "f(x)")

# Poisson CDF
plot(x_disc, ppois(x_disc, lambda = 5), type = "s", lwd = 2, col = "steelblue",
     main = "CDF: Poisson(λ=5)", xlab = "x", ylab = "F(x)")

# Normal CDF
plot(x_cont, pnorm(x_cont), type = "l", lwd = 2, col = "forestgreen",
     main = "CDF: Normal(0, 1)", xlab = "x", ylab = "F(x) = P(X ≤ x)")
abline(h = c(0.025, 0.975), lty = 2, col = "gray")

Common moments of distributions

Common Probability Distributions

A Bernouli Trial

Probability of Independent Events

Bernoulli Distribution

Describes the expected outcome of a single event with probability \(p\).

Example: Flipping a fair coin once

\[Pr(X=\text{Head}) = \frac{1}{2} = 0.5 = p\]

\[Pr(X=\text{Tails}) = \frac{1}{2} = 0.5 = 1 - p = q\]

Probabilities always sum to 1: \(p + q = 1\)

Let’s Simulate Coin Flips

Coin Flip Code

coin <- c("heads", "tails")
flips <- sample(coin, prob = c(0.5, 0.5), size = 100, replace = TRUE)
barplot(table(flips), col = c("steelblue", "coral"))

In-Class Demo: Law of Large Numbers with Coin Flips

Watch how the proportion of heads converges to 0.5 as we flip more coins:

LLN Coin Flips Code

set.seed(344)
par(mfrow = c(2, 3))

# Different sample sizes
for(n in c(1, 8, 16, 32, 400, 1000)) {
  flips <- rbinom(n = n, size = 1, prob = 0.5)  # 0 = heads, 1 = tails
  prop_heads <- sum(flips == 0) / length(flips)
  prop_tails <- 1 - prop_heads

  barplot(height = c(prop_heads, prop_tails),
          names.arg = c("heads", "tails"),
          col = c("steelblue", "coral"),
          ylim = c(0, 1),
          main = paste("n =", n, "flips"),
          ylab = "Proportion")
  abline(h = 0.5, lty = 2, col = "gray")
}
Tip

As sample size increases, observed proportions converge to true probability (0.5).

Law of Large Numbers in R

# Simulate cumulative means from a Normal(10, 2) population
set.seed(123)
n <- 1000
samples <- rnorm(n, mean = 10, sd = 2)
cumulative_means <- cumsum(samples) / 1:n

plot(1:n, cumulative_means, type = "l", col = "steelblue", lwd = 2,
     xlab = "Sample Size (n)", ylab = "Cumulative Mean",
     main = "Law of Large Numbers: Mean Converges to μ = 10")
abline(h = 10, col = "red", lwd = 2, lty = 2)
legend("topright", c("Cumulative Mean", "True Mean (μ = 10)"),
       col = c("steelblue", "red"), lwd = 2, lty = c(1, 2))
Figure 4: Sample means converge to population mean as sample size increases

Binomial Distribution

Results from combining several independent Bernoulli events.

\[\large f(k) = {n \choose k} p^{k} (1-p)^{n-k}\]

Where:

  • \(n\) = total number of trials
  • \(k\) = number of successes
  • \(p\) = probability of success

Binomial Distribution

Creating a Binomial Distributions

Binomial Distribution Code

# Probability of exactly 5 successes in 10 trials
dbinom(x = 5, size = 10, prob = 0.5)

# Plot distribution
plot(0:10, dbinom(x = 0:10, size = 10, prob = 0.5),
     type = "h", lwd = 3, col = "steelblue",
     xlab = "Number of Successes", ylab = "Probability")

Poisson Distribution

For discrete counts (e.g., snails per plot, neuron firings per second).

\[Pr(Y=r) = \frac{e^{-\lambda}\lambda^r}{r!}\]

Note

The Poisson is a single-parameter distribution: \(\mu = \sigma^2 = \lambda\)

Variables with variance > mean are called “overdispersed” (common in RNA-seq data).

Poisson Distribution Examples

Gene length by 500 nt bins

Increasing λ values

Creating Poisson Distributions

Poisson Distribution Code

# Probability of 2 counts given lambda = 1
dpois(x = 2, lambda = 1)

# Plot distribution
plot(0:15, dpois(x = 0:15, lambda = 3),
     type = "h", lwd = 3, col = "darkgreen",
     xlab = "Count", ylab = "Probability")

Geometric Distribution

Probability of observing \(k\) trials before the first success:

\[P(X=k)=(1-p)^{k-1}p\]

  • Mean = \(\frac{1}{p}\)
  • Variance = \(\frac{(1-p)}{p^2}\)

Example: If extinction probability is 0.1 per year, expected time to extinction?

Negative Binomial Distribution

Probability of the \(r^{th}\) success on the \(k^{th}\) trial:

\[P(X=k)=\binom{k-1}{r-1}p^{r}(1-p)^{k-r}\]

  • Mean = \(\frac{r}{p}\)
  • Variance = \(\frac{r(1-p)}{p^2}\)

Continuous Probability Distributions

Continuous Probability Distributions

\[P(a\leq X \leq b) = \int_{a}^{b} f(x) dx\]

The indefinite integral sums to one:

\[\int_{-\infty}^{\infty} f(x) dx = 1\]

Expectation: \[E[X] = \int_{-\infty}^{\infty} xf(x) dx\]

Uniform Distribution

All outcomes equally probable.

\[E[X] = \frac{(a+b)}{2}\]

Uniform Distribution Code

x <- seq(0, 10, 0.1)
plot(x, dunif(x, 0, 10), type = "l", lwd = 2, col = "purple",
     ylab = "Density", main = "Uniform Distribution (0, 10)")

Exponential Distribution

\[f(x)=\lambda e^{-\lambda x}\]

  • \(E[X] = \frac{1}{\lambda}\)
  • \(Var(X) = \frac{1}{\lambda^2}\)

Example: If λ equals the instantaneous death rate, the lifespan follows an exponential distribution.

Exponential Distribution

Exponential Distribution Code

x <- seq(0, 50, 0.5)
plot(x, dexp(x, rate = 0.1), type = "l", lwd = 2, col = "red",
     ylab = "Density", main = "Exponential Distribution (λ = 0.1)")

Gamma Distribution

Waiting time until the \(r^{th}\) event at rate \(\lambda\):

\[f(x) = \frac{\lambda^r x^{r-1} e^{-\lambda x}}{(r-1)!}\]

  • Mean = \(\frac{r}{\lambda}\)
  • Variance = \(\frac{r}{\lambda^2}\)

Example: Time until 1000 DNA strands synthesized at rate 1/ms.

The Normal or Gaussian Distribution

Normal (Gaussian) Distribution

\[f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \, \mathrm{e}^{-\frac{(x - \mu)^2}{2\sigma^2}}\]

Notation: \(v \sim \mathcal{N}(\mu, \sigma^2)\)

Normal Distribution

Normal Distribution

Why is the Normal Distribution Special?

Why Normal is Special in Biology

Why Normal is Special in Biology

Z-Scores of Normal distributions

Standardize variables to mean = 0, SD = 1:

\[\huge z_i = \frac{(x_i - \bar{x})}{s}\]

This is the standard normal distribution.

Sampling from distributions in R

Distribution Functions in R

Prefix Function Example
d Probability density/mass dnorm(0, 0, 1)
p Cumulative distribution pnorm(1.96, 0, 1)
q Quantile function qnorm(0.975, 0, 1)
r Random number generator rnorm(100, 0, 1)

Works for: binom, pois, exp, norm, geom, nbinom, unif, gamma

Estimates and Confidence Intervals

Parameter Estimation

  • Estimation infers population parameters from sample data
  • Sample estimates rarely equal population parameters exactly
  • Sampling distribution: all values we might obtain from samples
  • Standard error: standard deviation of sampling distribution
Important

NO ESTIMATE IS USEFUL WITHOUT A STANDARD ERROR!

Estimation Approaches

Approach Description
Parametric Assumes specific distributions
Resampling Bootstrap/randomization for empirical distributions
OLS Ordinary Least Squares optimization
Maximum Likelihood Model-based estimates with confidence
Bayesian Incorporates prior information

The Central Limit Theorem

For most data, we can’t determine sampling distributions empirically.

The CLT tells us that the sampling distribution of the mean approaches normal as sample size increases, regardless of the underlying distribution.

Visualizing the Central Limit Theorem

Figure 5: CLT: Sample means become normal even from non-normal populations

Visualizing the Central Limit Theorem

par(mfrow = c(2, 4))

# Exponential distribution (skewed)
set.seed(42)
exp_pop <- rexp(10000, rate = 1)
hist(exp_pop, main = "Population\n(Exponential)", col = "lightgray",
     border = "white", xlab = "Value", breaks = 30)

# Sample means for different n
for(n in c(5, 30, 100)) {
  means <- replicate(1000, mean(sample(exp_pop, n, replace = TRUE)))
  hist(means, main = paste("Sample Means\n(n =", n, ")"),
       col = "steelblue", border = "white", xlab = "Mean", breaks = 25)
}

# Uniform distribution
unif_pop <- runif(10000, 0, 1)
hist(unif_pop, main = "Population\n(Uniform)", col = "lightgray",
     border = "white", xlab = "Value", breaks = 30)

for(n in c(5, 30, 100)) {
  means <- replicate(1000, mean(sample(unif_pop, n, replace = TRUE)))
  hist(means, main = paste("Sample Means\n(n =", n, ")"),
       col = "coral", border = "white", xlab = "Mean", breaks = 25)
}

CLT in Quantitative Genetics

Why are complex traits normally distributed?

Simulate a trait controlled by 5 genes, each with additive effects:

CLT Quantitative Genetics Code

set.seed(245)
n_individuals <- 100

# Each locus contributes -1, 0, or +1 to phenotype (diploid genotypes)
contributions <- matrix(0, nrow = n_individuals, ncol = 5)
for(locus in 1:5) {
  genotype <- rbinom(n_individuals, size = 2, prob = 0.5)  # 0, 1, or 2 copies
  contributions[, locus] <- ifelse(genotype == 2, 1,
                                    ifelse(genotype == 1, 0, -1))
}

# Sum across loci to get phenotype
phenotype <- rowSums(contributions)

par(mfrow = c(1, 2))
hist(phenotype, col = "steelblue", border = "white",
     main = "5-Locus Trait Distribution",
     xlab = "Phenotype Value", breaks = seq(-6, 6, 1))

# Add environmental noise to make continuous
phenotype_continuous <- phenotype + rnorm(n_individuals, 0, 0.5)
hist(phenotype_continuous, col = "coral", border = "white",
     main = "With Environmental Variation",
     xlab = "Phenotype Value", breaks = 20)
Note

This is why quantitative traits (height, weight, blood pressure) are approximately normal!

Sampling variation of a parameter estimate

Sampling Variation of a Parameter

Estimation and Confidence Intervals

Standard Error of the Mean

\[\huge \sigma_{\bar{x}} \approx s_{\bar{x}} = \frac{s}{\sqrt{n}}\]

Note
  • SEM is NOT the standard deviation of the original distribution
  • SEM decreases as sample size increases

Calculating SEM

set.seed(32)
true_pop <- rnorm(n = 1000, mean = 2, sd = 5)

# Small sample
samps_5 <- replicate(n = 50, sample(true_pop, size = 5))
sem_5 <- sd(apply(samps_5, 2, mean)) / sqrt(50)

# Larger sample
samps_50 <- replicate(n = 50, sample(true_pop, size = 50))
sem_50 <- sd(apply(samps_50, 2, mean)) / sqrt(50)

cat("SEM (n=5):", round(sem_5, 4), "\nSEM (n=50):", round(sem_50, 4))
SEM (n=5): 0.2647 
SEM (n=50): 0.0828

Confidence Intervals

A confidence interval is a range of values about a parameter estimate such that we are X% certain the true population parameter lies within.

# 95% CI using t.test
sample_data <- rnorm(30, mean = 10, sd = 2)
t.test(sample_data)$conf.int
[1]  9.177721 10.707722
attr(,"conf.level")
[1] 0.95

Visualizing What “95% Confidence” Means

Figure 6: 95% of CIs from repeated sampling contain the true mean (μ = 10)

Visualizing What “95% Confidence” Means

true_mean <- 10
n_samples <- 50

# Take 50 samples and calculate CIs
results <- t(replicate(n_samples, {
  samp <- rnorm(20, mean = true_mean, sd = 2)
  ci <- t.test(samp)$conf.int
  c(mean = mean(samp), lower = ci[1], upper = ci[2])
}))

# Plot CIs - color red if they miss the true mean
contains_true <- results[, "lower"] <= true_mean & results[, "upper"] >= true_mean
colors <- ifelse(contains_true, "steelblue", "red")

plot(NULL, xlim = c(8, 12), ylim = c(0, n_samples + 1),
     xlab = "Value", ylab = "Sample Number", main = "50 Confidence Intervals")
abline(v = true_mean, col = "darkgreen", lwd = 2)
for(i in 1:n_samples) {
  segments(results[i, "lower"], i, results[i, "upper"], i, col = colors[i], lwd = 2)
  points(results[i, "mean"], i, pch = 19, cex = 0.6, col = colors[i])
}
legend("topright", c(paste(sum(contains_true), "contain μ"),
       paste(sum(!contains_true), "miss μ")), col = c("steelblue", "red"), lwd = 2)

Coefficient of Variation

To compare standard deviations across populations with different means:

\[CV = \frac{s}{\bar{x}} \times 100\%\]